{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
    "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
    "4. Run this cell to set up dependencies.\n",
    "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n",
    "\"\"\"\n",
    "# If you're using Google Colab and not running locally, run this cell.\n",
    "\n",
    "## Install dependencies\n",
    "!pip install wget\n",
    "\n",
    "## Install NeMo\n",
    "BRANCH = 'main'\n",
    "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n",
    "\n",
    "\"\"\"\n",
    "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n",
    "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n",
    "that you want to use the \"Run All Cells\" (or similar) option.\n",
    "\"\"\"\n",
    "# exit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import torch\n",
    "import os\n",
    "from nemo.collections.asr.metrics.wer import word_error_rate\n",
    "from nemo.collections.asr.parts.utils.vad_utils import stitch_segmented_asr_output, contruct_manfiest_eval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Offline ASR+VAD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we will demonstrate how to use offline VAD to extract speech segments and transcribe the speech segments with CTC models. This will help to exclude some non_speech utterances and could save computation resources by removing unnecessary input to the ASR system. \n",
    "\n",
    "The pipeline includes the following steps.\n",
    "\n",
    "0. [Prepare data and script for demonstration](#Prepare-data-and-script-for-demonstration)\n",
    "1. [Use offline VAD to extract speech segments](#Use-offline-VAD-to-extract-speech-segments)\n",
    "2. [Transcribe speech segments with CTC models](#Transcribe-speech-segments-with-CTC-models)\n",
    "3. [Stitch the prediction text of speech segments](#Stitch-the-prediction-text-of-speech-segments)\n",
    "4. [Evaluate the performance of offline ASR with VAD ](#Evaluate-the-performance-of-offline-VAD-with-ASR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare data and script for demonstration\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!mkdir -p data\n",
    "!wget -P data/ https://nemo-public.s3.us-east-2.amazonaws.com/chris-sample01_02.wav\n",
    "!wget -P data/ https://nemo-public.s3.us-east-2.amazonaws.com/chris-sample03.wav\n",
    "!wget https://nemo-public.s3.us-east-2.amazonaws.com/chris_demo.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_manifest=\"chris_demo.json\"\n",
    "vad_out_manifest_filepath=\"vad_out.json\"\n",
    "vad_model=\"vad_marblenet\" # here we use vad_marblenet for example, you can choose other VAD models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 10 $input_manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This cell is mainly for colab. \n",
    "# You can ignore it if run locally but do make sure change the filepaths of scripts and config file in cells below.\n",
    "!mkdir -p scripts\n",
    "if not os.path.exists(\"scripts/vad_infer.py\"):\n",
    "  !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/speech_classification/vad_infer.py\n",
    "if not os.path.exists(\"scripts/transcribe_speech.py\"):\n",
    "  !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/transcribe_speech.py\n",
    "    \n",
    "!mkdir -p conf/VAD\n",
    "if not os.path.exists(\"conf/VAD/vad_inference_postprocessing.yaml\"):\n",
    "    !wget -P conf/VAD/ https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/conf/VAD/vad_inference_postprocessing.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use offline VAD to extract speech segments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we are using very simple parameters to demonstrate the process. \n",
    "\n",
    "Please choose or tune your own postprocessing parameters. \n",
    "\n",
    "You can find more details in \n",
    "```python \n",
    "<NeMo_git_root>/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb and \n",
    "<NeMo_git_root>/scripts/voice_activity_detection/vad_tune_threshold.py\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The <code>vad_infer.py</code> script will help you generate speech segments. See more details in the script below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# if run locally, vad_infer.py is located in <NeMo_git_root>/examples/asr/speech_classification/vad_infer.py\n",
    "%run -i scripts/vad_infer.py --config-path=\"../conf/VAD\" --config-name=\"vad_inference_postprocessing.yaml\" \\\n",
    "dataset=$input_manifest \\\n",
    "vad.model_path=$vad_model \\\n",
    "frame_out_dir=\"chris_demo\" \\\n",
    "vad.parameters.window_length_in_sec=0.63 \\\n",
    "vad.parameters.postprocessing.onset=0.5 \\\n",
    "vad.parameters.postprocessing.offset=0.5 \\\n",
    "vad.parameters.postprocessing.min_duration_on=0.5 \\\n",
    "vad.parameters.postprocessing.min_duration_off=0.5 \\\n",
    "out_manifest_filepath=$vad_out_manifest_filepath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at VAD output. If there are no speech segments in the sample. The sample will not appear in VAD output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 10 $vad_out_manifest_filepath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transcribe speech segments with CTC models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "segmented_output_manifest=\"asr_segmented_output_manifest.json\"\n",
    "asr_model=\"stt_en_citrinet_1024_gamma_0_25\" # here we use citrinet for example, you can choose other CTC models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The <code>transcribe_speech.py</code> script will help you transcribe each speech segment. See more details in the script below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# if run locally, transcribe_speech.py is located in <NeMo_git_root>/examples/asr/transcribe_speech.py\n",
    "%run -i scripts/transcribe_speech.py \\\n",
    "    pretrained_name=$asr_model \\\n",
    "    dataset_manifest=$vad_out_manifest_filepath \\\n",
    "    batch_size=32 \\\n",
    "    amp=True \\\n",
    "    output_filename=$segmented_output_manifest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at the segmented ASR transcript."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 5 $segmented_output_manifest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stitch the prediction text of speech segments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also evaluate the whole ASR output by stitching the segmented outputs together.\n",
    "\n",
    "Note, there would be a better method to stitch them together. Here, we just demonstrate the simplest method, concatenating."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stitched_output_manifest=\"stitched_asr_output_manifest.json\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stitched_output_manifest = stitch_segmented_asr_output(segmented_output_manifest)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at the stitched output and the stored speech segments of the first sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "stitched_output = []\n",
    "for line in open(stitched_output_manifest, 'r', encoding='utf-8'):\n",
    "    file = json.loads(line)\n",
    "    stitched_output.append(file)\n",
    "\n",
    "print(stitched_output[0])\n",
    "print(f\"\\n The speech segments of above file are \\n {torch.load(stitched_output[0]['speech_segments_filepath'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate the performance of offline VAD with ASR "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we have ground-truth <code>'text'</code> in input_manifest, we can evaluate our performance of stitched output. Let's align the <code>'text'</code> in input manifest and <code>'pred_text'</code> in stitched segmented asr output first, since some samples from input_manfiest might be pure noise and have been removed in VAD output and excluded for ASR inference. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aligned_vad_asr_output_manifest = contruct_manfiest_eval(input_manifest, stitched_output_manifest)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 10 $aligned_vad_asr_output_manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predicted_text, ground_truth_text = [], []\n",
    "for line in open(aligned_vad_asr_output_manifest, 'r', encoding='utf-8'):\n",
    "    sample = json.loads(line)\n",
    "    predicted_text.append(sample['pred_text'])\n",
    "    ground_truth_text.append(sample['text'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metric_value = word_error_rate(hypotheses=predicted_text, references=ground_truth_text, use_cer=False)\n",
    "print(f\"WER is {metric_value}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
